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Abstract 

We present several theoretical contributions which allow Lie group, or continuous 
transformation, models to be fit to large high dimensional datasets. We then demonstrate 
training of Lie group models on natural video. Transformation operators are represented 
in their eigen-basis, reducing the computational complexity of parameter estimation to 
that of training a linear transformation model. A transformation specific "blurring" 
operator is introduced that allows inference to escape local minima via a smoothing 
of the transformation space. A penalty on traversed manifold distance is added which 
encourages the discovery of sparse, minimal distance, transformations between states. 



Both learning and inference are demonstrated using these methods for the full set of 
affine transformations on natural image patches. Transformation operators are then 
trained on natural video. It is shown that the learned video transformations provide 
a better description of inter-frame differences than the standard motion model, rigid 
translation. 



1 Introduction 



A fundamental problem in vision is to find compact descriptions for how images change 
over time. Such descriptions may provide clues to the representation used in the brain 
pinhauser et al.[[2002[|Olshausen et aLj |2007[ [Cadieu and Olshausen[|2008[ , and they 
could lead to more efficient video compression [ jWiegand et al. 2003 1 and motion esti- 
mation algorithms. Better characterization of the statistics of natural video would also 
allow for the generation of more natural, controlled stimuli for use in psychophysi- 



cal and neurophysiological experiments [Victor et al. 2006 1. Finally an understanding 
of dynamics could lead to better methods for extracting visual invariants - both form 
invariants, where an object retains its form under changes in position, lighting or occlu- 
sion gLeCun et alT||2004HWa"llis and Rolls] 1 19971 [Serre et al.l|2007l , and transformation 



invariants [Cadieu and Olshausen 2008 1, where the same transformation is applied to 
an object independent of its form. 



Motivated by the problem of recognizing form invariants, [Rao and Ruderman 



1999 [ introduced the idea of learning a Lie, or continuous transformation, group rep- 



resentation of the dynamics which occur in the visual world. The Lie group is built 
by first describing all infinitesimal transformations which an image may undergo. The 
full group is then generated from all possible compositions of those infinitesimal trans- 
formations, which allows for transformations to be applied smoothly and continuously. 
A large class of visual transformations, including all the affine transformations, inten- 
sity changes due to changes in lighting, contrast changes and spatially localized ver- 
sions of all the preceding, can be described simply using Lie group operators (although 
other transformations - for instance moving occlusion boundaries - cannot be easily de- 



scribed). In [Miao and Rao 2007 Rao and Ruderman 1999 1, the Lie group operators 



were trained on image sequences containing a subset of affine transformations. [Memi 
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sevic and Hinton[ |2010[ trains a second order restricted Boltzmann machine on pairs 



of frames, an alternative technique which also shows promise for capturing temporal 
structure in video. 

Unfortunately, despite the simplicity and power of the Lie group representation, 
training such a model is difficult, partly due to the high computational cost of evalu- 
ating and propagating learning gradients through matrix exponentials. Previous work 
HOlshausen et al.[ |2007[ |Rao and Ruderman] , |1999[ |Miao and Rao[ |2007[ | has approx- 
imated the full model using a first order Taylor expansion, reducing the exponential 
model to a linear one. While computationally efficient, a linear model approximates 
the full exponential model only for a small range of transformations. This can be a 
hinderance in dealing with real world data, which oftentimes contain a large range of 



changes between pairs of video frames. Note that in [Miao and Rao, 2007 1, while the 
full Lie group model is used in inferring transformation parameters, only its linear ap- 
proximation is used during learning. [ Culpepper and 01shausen[ 2010 1 work with a full 
exponential model, but their technique requires performing a costly eigendecomposi- 
tion of the effective transformation operator for each sample and at every learning or 
inference step. 

Another hurdle one encounters is that the inference process, which computes trans- 
formation coefficients given a pair of images, is highly non-convex with many local 
minima. This problem has been extensively studied in image registration, stereo match- 
ing and the computation of optic flows. For a certain set of transformations (transla- 



tion, rotation and isotropic scaling), [ |Kokiopoulou and Frossard[|2009| showed that one 
could find the global minima by formulating the problem using an overcomplete im- 
age representation. For arbitrary transformations, one solution is to initialize inference 
with many different coefficient values [ |Miao and Rao[ 2007 1; but the drawback is that 
the number of initial guesses needed grows exponentially with the number of trans- 
formations. Alternatively, [ |Lucas and Kanade[|198T][A"rathorn[|2002[|Vasconcelos and 



Lippman[ 1997, Black and Jepson[ |1996[ performs matching with an image pyramid. 



using solutions from a lower resolution level to seed the search algorithm at a higher 



resolution level. [Culpepper and Olshausen 2010 [ used the same technique to perform 
learning with Lie Group operators on natural movies. Such piecewise coarse-to-fine 
schemes avoid local minima by searching in the smooth parts of the transformation 
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space before proceeding to less smooth parts of the space. However, they require that 
the coarsening the transformation corresponds to spatial blurring. As we show here, it is 
also possible to smooth the transformation space directly, resulting in a robust method 
for estimating transformation coefficients for arbitrary transformations. 

In this work we propose a method for directly learning the Lie group operators 
that mediate continuous transformations, and we demonstrate the ability to robustly 
infer transformations between frames of video using the learned operators. The com- 
putational complexity of learning the operators is reduced by re-parametrizing them in 
terms of their eigenvectors and eigenvalues, resulting in a complexity equivalent to that 
of the linear approximation. Inference is made robust and tractable by smoothing the 
transformation space directly, which allows for a continuous coarse-to-fine search for 
the transformation parameters. Both learning and inference are demonstrated on the full 
set of affine transformations using a Lie group framework. The same technique is then 
used to learn a set of canonical transformations describing changes between frames 



in natural movies. Unlike previous Lie group implementations [Rao and Ruderman 



T9991 |Miao and Raoj [20071 pishausen eTaLj [20071 [Culpepper and Olshausenj [20T0l , 



we demonstrate an ability to work simultaneously with multiple transformations and 
large inter-frame differences during both inference and learning. 



2 Continuous Transformation Model 



As in [Rao and Ruderman 1999 [, we consider the class of continuous transformations 



described by the first order differential equation 
with solution 

y{s)=e^'y{0)=T{s)y{0). (2) 

Here A E ji^x-N infinitesimal transformation operator and the generator of the 
Lie group; s E 7^ is a coefficient which controls its application; T (s) = e^* is a 
matrix exponential defined by its Taylor expansion and y (s) E IZ^^^ is the signal y (0) 
transformed by A to a degree controlled by s. 
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The goal of this paper is to use transformations of this form to model the changes be- 
tween adjacent frames x^^\x^^^^^ E IZ^^^ in video. That is, we seek to find the model 
parameters A (adapted over an ensemble of video image sequences) and coefficients 
s*^*-* (inferred for each pair of frames) that minimize the reconstruction error 

E = J]||x(*+i)-r(sW)xW||'. (3) 

i 

We will later extend this to multiple transformations. 



2.1 Eigen-decomposition 

To derive a learning rule for A, it is necessary to compute the gradient Naively 



this costs O (N^) time [Ortiz et al. 2001 1 (O {N'^) operations per element, and A^"^ ele- 
ments), making it computationally intractable for many problems of interest. However, 
this computation reduces to O (A^^) (the same complexity as its linear approximation) 
if A is rewritten in terms of its eigen-decomposition 

A = UMJ-^ (4) 

and learning is instead performed directly in terms of U and A U E C^^^ is a 
complex matrix consisting of the eigenvectors of A, A G C^^^ is a complex diagonal 
matrix holding the eigenvalues of A, and U^^ is the inverse of U. The matrices must be 
complex in order to facilitate periodic transformations, such as rotation. However, note 
that U need not be orthonormal. The benefit of this representation is that 
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e—- " =I + UkU-^s + -UkU-^UKU-^s^ + ... = Ue''^U-^ (5) 

where the matrix exponential of a diagonal matrix is simply the element-wise expo- 
nential of its diagonal entries. This representation therefore replaces the full matrix 
exponential by two matrix multiplications and an element-wise exponential. 



2.2 Adaptive Smoothing 

Unfortunately, in general the reconstruction error described by Equation[3]is highly non- 
convex in s and contains many local minima. To illustrate, the red solid line in Figure 

^ This change of form enforces the restriction that A be diagonalizable. 
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[T] plots the reconstruction error for a white-noise image patch shifted by three pixels to 
the right as a function of transformation coefficient s for a generator A corresponding 
to left-right translation. It is clear that performing gradient based inference on s for this 
error function would be problematic. 

To overcome this problem, we propose an alternative transformation, motivated by 
modem image matching algorithms [ |Arathom| |2002[ |Vasconcelos and Lippman[ |1997[ 



Black and Jepson[ |1996[ [Lucas and Kanadef |1981[ , that adaptively smooths the error 



function in terms of the transformation coefficient. This is achieved by averaging over 
a range of transformations using a Gaussian distribution for the coefficient values 

1 ik-MiP 
T(/i,a) = / T{s)^^e^^ds 
J-oo v27r(T 

and replacing T (s) with T (/x, a) in Equation |3} which is then minimized with respect 
to both /i and crj^ 

Instead of the single best s that minimizes E, inference using T (/i, a) finds a Gaus- 
sian distribution over s, effectively blurring the signal along the transformation direc- 
tion given hy A = U AU^^. In the case of translation, for instance, this averaging over a 
range of transformations blurs the image along the direction of translation. The higher 
the value of a, the larger the blur. Under simultaneous inference in /i and a, images 
are matched first at a coarse scale, and the match refines as the blurring of the image 
decreases. 

To illustrate the way in which the proposed transformation leads to better inference, 
the dotted lines in Figure [T] shows the reconstruciton error as a function of /i with dif- 
ferent values of a. Note that, by allowing a to vary, steepest descent paths open out of 
the local minima, detouring through coarser scales. 



This can alternatively be seen as introducing an additional transformation operator, this one a 
smoothing operator 

Asmooth = ^C/A2[/-1, (6) 

with coefficient Ssmooth = f^. Agmooth smooths along the transformation direction given by A = 

C/AC/-1. 
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Reconstruction error as a function of for different values of o 




Figure 1 : Reconstruction error (Equation [3]) as a function of ji for different values of 
a. In this case the target pattern x^*^^'^ has been translated in one dimension relative to 
an initial white noise pattern x'^*), and the operator A is the one-dimensional translation 
operator. Local minima in terms of fi can be escaped by increasing a. 

3 Multiple Transformations 

A single transformation is inadequate to describe most changes observed in the visual 
world. The model presented above can be extended to multiple transformations by 
concatenating transformations in the following way: 

Tmuiti (/i, cr) = Ti (/ii,cri)T2 (/i2,a2) ... = J^Tfc (7) 

k 

Tk{fik,cTk) = Uke^'^'^'^e'^^^'U^' (8) 

where k indexes the transformation. Note that the transformations Tk {fj^k, crk) do not in 
general commute, and thus the ordering of the terms in the product must be maintained. 

Because of the fixed ordering of transformations and due to the lack of commutativ- 
ity, the multiple transformation case no longer constitutes a Lie group for most choices 
of transformation generators A. Describing the group structure of this new model is 
a goal of future work. For present purposes we note that for several obvious video 
transformations - affine transformations, brightness scaling, and contrast scaling - the 
accessible transformations are not restricted by the model form in Equation [7} though 
the choice of coefficient values Hk for a transformation may depend heavily on the order 
of terms in the product. 
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4 Regularization via Manifold Distance 



In order to encourage the learned operators to act independently of each other, and to 
learn to transform between patches in the most direct way possible, we penalize the 
distance through which the transformations move the image patch. If the initial image 
patch is yi(0), this distance can be expressed as 

d (Tinuiti cr) yi{0)) = ^ o? (T^ (/Xfe, ak) yk{0)) (9) 

k 

where ytiO) = Y[m<k '^rn il^m, CTm) 1/1(0) is the image patch before application of trans- 
formation k. Assuming a Euclidean metric, the distance d (T^ (//fe, ak) yfe(O)) traversed 
by each single transformation in the chain is 

d{Tk{iJ,k,crk)yk{Q)) = / \\yk{r)\\2dT (10) 

Jt=0 

\\Akyk{T)\\2dT (11) 



r=0 



r Pfee^'=^y,(0)||2dr (12) 

Jt=Q 



It=0 

Finding a closed form solution for the above integral is difficult, but it can be approxi- 
mated using a linearization around t = |, 

d{n{iik,<yk)ykm ^ likWAke^^^VkmU- (13) 

Since this penalty is applied individually to each operator, it also acts similarly to 
an LI penalty on the path length of the transformations. This penalty will encourage 
travel between 2 points to occur via a path described by a single transformation, rather 
than by a longer path described by multiple transformations. 



5 The Complete Model 

The full model's objective function is 

E a, U, A,x)= r]nY.\ 1^^*^'^ ' Tmuiu a^'^) x^'^ 
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t 



+ E E l^k I \Ake+^'^xf 1 12 (14) 
t k 



t k 
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(where a small L2 regularization term on al was found to speed convergence during 
learning). We used T]n = 1, r/d = 0.005, and rj^ = 0.01. Derivatives of the energy 
function are provided in the appendix. This model can be recast into a probabilistic 
framework in a straightforward fashion by setting 

p{x, fi,a\U,A) oc exp {—E (/x, a, U, A, x)) . (15) 



6 Inference and Learning 

To find U and A, we employ a variational Expectation-Maximization type of optimiza- 
tion strategy, which iterates between the following two steps: 

1. Load a fresh set of video patches x, then find optimal estimates for the latent 
variables jl and a while holding the estimates of the model parameters U and A 
fixed, 

fi,a = axgminE (fi, a, U, A, x] . (16) 

2. Optimize the estimated model parameters U and A while holding the latent vari- 
ables fixed, 

U,A = a.TgmmE(fi,a,U,A,x) (17) 

U,A 

All optimization was performed using the L-BFGS implementation in Mark Schmidt's 



minFunc [Schmidt, 2009 1. A similar optimization scheme has been used in [Lewicki 



and 01shausenl[T999l . 



There is a degeneracy inUk, in that the columns (corresponding to the eigenvectors 
of Ak) can be rescaled arbitrarily, and will remain unchanged as the inverse scaling 
will occur in the rows of f/^^. If not dealt with, Uk and/or f/^^ will random walk 
into an ill conditioned scaling over many learning steps. As described in detail in the 
appendix, this effect is compensated for by rescaling the columns of Uk such that they 
have identical power to the corresponding rows in [/^^. 
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7 Experimental Results 



7.1 Inference with Affine Transforms 



To verify the correctness of the proposed inference algorithm, a set of known transfor- 
mations were applied to natural image patches, and the transformation coefficients were 
inferred using a set of hand-designed operators. For this purpose a pool of 1000 11x 11 
natural image patches were cropped from a set of short BBC video clips. Each image 
patch was transformed by the full set of affine transformations simultaneously with the 
transformation coefficients drawn uniformly from the ranges listed below. |^ 



Transformation Type 


Range 


horizontal translation 


± 5 pixels 


vertical translation 


± 5 pixels 


rotation 


±180 degrees 


horizontal scaling 


±50% 


vertical scaling 


±50% 


horizontal skew 


±50% 



The proposed inference algorithm (Equation 16) is used to recover the transforma- 
tion parameters. Figure[2]shows the fraction of the recovered coefficients which differed 
by less than 1% from the true coefficients. The distribution of the PSNR of the recon- 
struction is also shown. The inference algorithm recovers the parameters with a high 
degree of accuracy. The PSNR in the reconstructed images patches was also higher 
than 25dB for 85% of the transformed image patches. In addition, we find that adaptive 
blurring significantly improved inference, as evident in Figure |2^. 



7.2 Learning Affine Transformations 

To demonstrate the ability to learn transformations, we trained the algorithm on image 
sequences transformed by a single affine transformation operator (translation, rotation, 
scaling or skew). The training data we used were single image patches from the same 



BBC clips as in Section 7.1 transformed by an open source Matlab package [Shen 



20081 with the same transformation range used in Section 7.1 



Vertical skew is left out since it can be constructed using a combination of rotation and scaling 
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inference accuracy 
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the histogram of PSNR of reconstruction 




Figure 2: (a) The fraction of recovered coefficients which differed by less than 1% 
from the true coefficient values. Image patches were transformed using a set of hand 
coded affine transformations (all transformations simultaneously), and recovery was 



performed via gradient descent of Equation 14 Inference with and without adaptive 
blurring is compared, (b) The distribution of PSNR values for image patches recon- 
structed using coefficients inferred with adaptive blurring. 

The affine transformation operators are derivative operators in the direction of mo- 
tion. For example, a horizontal translation operator is a derivative operator in the hor- 
izontal direction while a rotation operator computes a derivative radially. Our learned 
operators illustrate this property. Figure |3] shows two of the learned transformation 
operators, where each 11x11 block corresponds to one column of A and the block's 
position in the figure corresponds to its pixel location in the original image patch. This 
can be viewed as an array of basis functions, each one showing how intensity at a given 
pixel location influences the instantaneous change in pixel intensity at all pixel loca- 
tions (see Equation [T]). In this figure, each basis represents a numerical differentiator. 
The bottom two rows of Figure |3] show each of the operators being applied to an image 
patch. An animation of the full set of learned affine operators applied to image patches 
can be found in the supplementary material. 
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Horizontal Translation Rotation 




(a) 



(b) 



5 = -1D.Da s = -S.OO 5 = -BaD 
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S = -I5.00 3 = -1200 s = -9.D0 3 = -6 00 s = -3 00 3 = 00 3 = 3 00 3 = 6.00 3 = 00 3 = 12.00 s = 1 5 00 



Figure 3: The transformation operators that corresponds to horizontal translation (a) and 
rotation (b). Each 11 x 11 block corresponds to one column of A and the block's position 
in the figure corresponds to its pixel location in the original image patch. Each block 
therefore shows how intensity at one pixel location contributes to the instantaneous 
change in intensity at other pixel locations. Note that the blocks correspond to spatial 
derivatives in the direction of motion. Panes (c) and ( d) show the translation and rotation 
operators, respectively, being applied to an image patch. 

7.3 Learning Transformations for Time-Varying Natural Images 

To explore the transformation statistics of natural images, we trained the algorithm 
on pairs of 17 x 17 image patches cropped from consecutive frames obtained from a 
corpus of short videos from The BBC's Animal World Series. In order to allow the 
learned transformations to capture image features moving into and out of a patch from 
the surround, and to allow more direct comparison to motion compensation algorithms, 
the error function for inference and learning was only applied to the central 9x9 region 
in each 17x17 patch. Each patch can therefore be viewed as a 9 x 9 patch equipped 
with a 4 pixel wide buffer region. In the 15 transformation case, for computational 
reasons only a 2 pixel wide buffer region was used, so the 15 transformation case acts 
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Figure 4: Sample transformation operators from a set of 15 transformations trained 
in an unsupervised fashion on 13x13 pixel patches (including a 2 pixel buffer region) 
from natural video. Each 13 x 13 block corresponds to one column of A and the block's 
position in the figure corresponds to its pixel location in the original image patch. Each 
block therefore illustrates the influence a single pixel has on the entire image patch 
as the transformation is applied, (a) is a full field translation operator, (b) performs 
full field intensity scaling, (c) performs full field contrast scaling, and (d) is difficult to 
interpret. 



on 13x13 pixel patches with the reconstruction penalty on the central 9x9 region. 

Training was performed on a variety of models with different numbers of transfor- 
mations. For several of the models two of the operators were pre-coded to be whole- 
patch horizontal and vertical translation. This was done since we expect that translation 
will prove to be the predominant mode of transformation in natural video, and this 
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allows the algorithm to focus on learning less obvious transformations contained in 
video with the remaining operators. This belief is supported by the observation that 
several operators become full field translation when learning is unconstrained, as in 
the 15 transformation case in Figure |5j Hardcoding translation also provides a useful 
basis of comparison to existing (whole-patch translation based) motion compensation 
algorithms used in video compression. 

The model case with the greatest variety of transformation operators consisted of 15 
unconstrained transformations. A selection of the learned Ak can be found in Figure |4j 
The learned transformation operators performed full field translations, intensity scaling, 
contrast scaling, spatially localized mixtures of the preceding 3 transformation types, 
and a number of transformations with no clear interpretation. 

Animations showing a full set of learned operators acting on patches can be found 
in the supplementary materials. 

average PSNR for different models 

I I 15 learned 

□ cont. trans + 4 learned 
I I cont. trans + 3 learned 
I I cont. trans + 2 learned 

□ cont. trans + 1 learned 

□ cont. trans, with sIgma 

□ cont. trans, without sIgma 
■ quarter pixel trans 

□cMn M fiJ" P''<^' trans. 




Model 



Figure 5: PSNR of the reconstruction of the second frame from 1,000 pairs of frames 
from natural video using a variety of model configurations. 

To demonstrate the effectiveness of the learned transformations at capturing the in- 
terframe changes in natural video, the PSNR of the image reconstruction for 1 000 
17 X 17 2 time-step video patches was compared for all of the learned transformation 
models, as well as to standard motion compensation reconstructions. The models com- 
pared were as follows: 
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1. No transformation. Frame x^^^ is compared to frame a;(*+i) without any transfor- 
mation. 

2. Full pixel motion compensation. The central 9x9 region of a;(*+i) is compared 
to the best matching 9x9 region in x*^*) with full pixel resolution. 

3. Quarter pixel motion compensation with bilinear interpolation. The central 9x9 
region of is compared to the best matching 9x9 region in x*^*^ with quarter 
pixel resolution. 

4. Continuous translation without sigma. Only vertical and horizontal translation 
operators are used in the model, but they are allowed to perform subpixel transla- 
tions. 

5. Continuous translation with sigma. Vertical and horizontal translation operators 
are used in the model, and in addition adaptive smoothing is used. 

6. Continuous translation plus learned operators. Additional transformation opera- 
tors are randomly initialized and learned in an unsupervised fashion. 

7. 15 learned transformation operators. Fifteen operators are randomly initialized 
and learned in an unsupervised fashion. No operators are hard coded to transla- 
tion. 



7.4 Reconstructing Time-Varying Natural Images 

As shown in Figure [5] there is a steady increase in PSNR as the transformation model 
gets more complex. The use of the learned operators for video compression is explored 



more fully in Wang et al. [2011 1. 



Conclusions 

We have demonstrated and tested a method for learning Lie group operators from im- 
age sequences. We made the problem computationally tractable by utilizing an eigen- 
decomposition of the operator matrix, in addition to a transformation specific blurring 
operator which removed local minima. This model was then applied to recovering 
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transformation coefficients for affine transformations, learning affine transformations, 
and to learning the transformations between frames in natural video. These experiments 
demonstrated that the method was effective and tractable, and that it is able to perform 
both inference and learning with many transformations on very large inter-frame trans- 
formations. We have also demonstrated that the learned models allow better description 
of the dynamics of natural movies than the standard rigid translation model. 
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A Appendix - Degeneracy in U 

We decompose our transformation generator 

A = VKV-^ (A-1) 

where A is diagonal. We introduce another diagonal matrix R. We can populate the 
diagonal of R with anything (non-zero) we want, and the following equations will still 
hold: 

A = VKV~^ (A-2) 

= VRR-^KV-'^ (A-3) 

= VRKR-W'^ (A-4) 

= {VR)K{VR)~^ (A-5) 

If we set 

U^VR (A-6) 
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then 



A = UkU 



-1 



(A-7) 



and R represents a degeneracy in the set of U we are allowed to choose in our decom- 
position. 

We remove this degeneracy in U by choosing it! so as to minimize the joint power 
after every learning step. That is 



R = argmi„j:5:V.X + ^^(fl-');.(l/-') 

i i i j 

= «gmioj:j:v.x+EE(''-')j.4 



setting the derivative to 



33 



33 



R 



'33 



/J'l 



Practically this means that, after every learning step, we set 



R 



'33 



3i 



and then set 



^new U R 



(A-8) 
(A-9) 



(A-10) 
(A-11) 

(A-12) 
(A-13) 



(A- 14) 



(A- 15) 



B Appendix - Derivatives 



Let 



(B-1) 
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B.l Derivative for inference with one operator 



The learning gradient with respect to is 



OS V — > , ^ OC 1 a2 2 1 

J]2err(n)t/— 6 " U'^X^ 



'2err{n)UAe^V^^''''u-^Xn 

n 

where err{n) is the reconstruction error of the n*'* sample 



Similarly, the learning gradient with respect to a is 

de 



da 



B.2 Derivative for learning with one operator 

The learning gradient with respect to A is 

n 

It is easy to see then this derivative with respect to each element in A 
Therefore, in matrix form, the derivative is 



1^ = - ^ 2err(n)C/(/ie''^ + a'Ae^^''^') IJ-^X^ 



The learning gradient with respect to U is 

— = -^2err(n)- . 



-Y^2eTT{n)%e^^e^^''^'u-^X^ 

f)TJ — l 

- ^2err(n)[/e'^^e5^''^'^^X„ 
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Recall that 



(B-10) 



U dU 

The learning gradient is hence 
oU ^-^ 

n 
n 

B.3 Derivative for Complex Variables 

To accommodate for the complex variables U and A, we rewrite our objective function 
as 

e = ''^^err{n)^err{n) (B-11) 



where err(n) denotes the complex conjugate. The derivative of this error function 
with respect to any complex variable can be then broken into the real and and imaginary 
part 



de yr-^ r,,derr{n) derr(n)^ ^ 



\T f derr(n)\ ( derr(n)\ ^-^ 

2^err{n) qj^ j + y J ^^^H 



*1 2S E-W^^ 



3? 

' dAj 1 ^-■'■-^ \ OA 

Q ^ — 1 — —2Q < err{n) 



(B-12) 



B.4 Derivatives in Matrix Notation 

For completeness, we can write the derivatives in matrix notation as follows. 
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^ = -2E^UAKU-^X 
1^ = -2aE'^UA^KU-^X 

f (B-13) 

dk ^ ' 

de 

— = -2E^KU-^X + 2E^UKU-^U-^X 
oU 



where 



E and X are matrices with columns of err(n) and X^ respectively. 

B.5 Derivatives for Manifold Penalty 

We have a model 

i^AI (B-15) 

with solution 

h = e^'Io. (B-16) 

We want to find and minimize the distance traveled by the image patch Iq to h 
under the transformation operator A. The total distance is 

f \\i\\ldt. (B-17) 
Jt=o 

This then gives the following, 

r wAx'-'^widt 

Jt=o 



lAe'^'loWldt 



t=o 

s 



t=0 
s 



(B-18) 



^/{A^^H)f{A^^dt 
Jlfi^^^YA^M^^dt 

lt=0 

We don't know how to solve this analytically. Instead we make the approximation 
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s 



t=0 



\i\\2dt s\\AIo\\2, (B-19) 



with derivatives 



d-s(/oMM7o)^ 
dd 



(I^A^AIo)^ . (B-20) 

dA 4' 



ds 

dd 1 



s-mA'AIor-^2AIoli 



Second approximation 



||/||2cii ~S||^/|||2 
t=o (B-21) 

^s\\Ae^ilo\\2 



The derivative is therefore 



= {l^ieWA^Ae^ilo) + if-^^A^Ae^^I, + 7o^(e^i)^A^A^/o 
= B + 2sB 

(B-22) 

where S = l'^ {e^if A^ Ae'^i 

We can check this approximation against numerical integrals 
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